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Abstract — One main challenge in the design of distributed 
storage codes is the Exact Repair Problem: if a node storing 
encoded information fails, to maintain the same level of reliability, 
we need to exactly regenerate what was lost in a new node. A 
major open problem in this area has been the design of codes 
that i) admit exact and low cost repair of nodes and ii) have 
arbitrarily high data rates. 

In this paper, we are interested in the metric of repair 
locality, which corresponds to the the number of disk accesses 
required during a node repair. Under this metric we characterize 
an information theoretic trade-off that binds together locality, 
code distance, and storage cost per node. We introduce Locally 
repairable codes (LRCs) which are shown to achieve this tradeoff. 
The achievability proof uses a "locality aware" flow graph gadget 
which leads to a randomized code construction. We then present 
the first explicit construction of LRCs that can achieve arbitrarily 
high data-rates. 

I. Introduction 

Distributed and cloud storage systems have reached such 
a massive scale that recovery from failures is now part of 
regular operation rather than a rare exception. These large 
scale storage systems have to allow for high data availability 
and be able to tolerate multiple physical node failures to 
prevent data loss. These systems can achieve the targeted 
data availability and reliability requirements by introducing 
redundancy among the stored bits. Erasure coded storage 
systems achieve high reliability without requiring the increased 
storage cost that is associated with data replication |5|. Three 
application contexts where erasure coding techniques are being 
currently deployed or under investigation are Cloud storage 
systems like Facebook's Hadoop cluster, archival storage, and 
peer-to-peer storage systems like Cleversafe and Wuala (see 
e.g. ig, 0) 

A central issue that arises in coded storage is the Repair 
Problem: how to maintain the encoded representation when 
failures (node erasures) occur To maintain the same redun- 
dancy when a storage node leaves the system, a newcomer 
node has to join the array, access some existing nodes, and 
exactly reproduce the lost contents. During this repair process, 
there are several metrics that can be optimized: the total 
information read from existing disks, the total number of 
bits communicated in the network |[7)-p2) (called repair 
bandwidth 121), or the total number of disks required for each 
repair Q, Currently, the most well-understood metric 
is repair bandwidth that was characterized in |2|. A great 
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variety of asymptotic and explicit repair bandwidth optimal 
code constructions were introduced in |^|, [7j-[12J. 

It seems that for cloud storage applications, the main repair 
performance bottleneck is the disk I/O overhead. The disk I/O 
is proportional to the number of nodes r involved in the repair 
process of a failed node. This number r defines our metric of 
interest; repair locality. 

Repair locality was identified as a good metric for repair 
cost independently by Gopalan et al. p4) , Oggier et al. ||6j, 
and Papailiopoulos et al. p3) . Codes that have good locality 
properties where studied in Q, ||T3|-||T5), |[T8|. In |14j, a 
trade-off between locality and code distance, i.e., reliability, 
was defined for scalar linear codes. However, up to now 
there do not exist explicit and high rate codes optimized for 
locality and there is no universal approach (for both linear and 
nonlinear codes) that characterizes the information theoretic 
limits between repair locality r, code distance d, and storage 
per node a. In this paper we address this open problem. 

Our Contribution: Let a file of size M that is cut in 
k pieces which are encoded in n > fc elements of size 
a — (1 + e)^. We establish an information theoretic tradeoff 
between the repair locality r, the code distance d, and the 
amount of storage spent per node a, for storage codes of length 
n. We derive our bounds using a characterization of the code 
distance d in terms of entropy. A new information flow graph is 
fundamental to our derivations. Using random linear network 
coding (RLNC) arguments on this flow graph p6) , we show 
that linear vector codes suffice to achieve the trade-off. We 
call these optimal codes locally repairable codes. 

Then, we focus on the operational point where any k coded 
elements can recover the file, i.e., d — n — k+\. We construct 
the first explicit family of LRCs that have locality r at the cost 
of an excess storage overhead of e = - . This cost can be made 
asymptotically (in n, k) negligible when r is any sub-linear 
function of k such as r = log(/c), or r = \/k. Our designs are 
vector linear, work for any n, fc, r and require finite field of 
order n. A general LRC construction for any feasible point of 
the tradeoff is left as an interesting open problem. 

II. Repair Locality vs. Code Distance vs. Storage 
Capacity 

A. Storage Code Distance Through Entropy 

In the following, we see how we can use entropy on the 
coded elements of a storage code to make arguments on 
the code distance. This way, we aim to establish a universal 
information theoretic tradeoff of (linear or nonlinear) codes 



that binds together the metric of locality, the code distance, 
and the storage capacity spent for each coded element or node. 
We would like to note that in many points in this work, we 
use the phrase "coded element" instead of "node". 
Let a file of size Af be cut in k equally sized pieces 



^=[Xi...Xk 



(1) 



where XiS can be viewed as k source elements over some finite 
field F that are i.i.d. random variables each having entropy 
H{Xi) = ^, for all i € [k], where [N] denotes the set of 
integers {1,...,A^}. Moreover, let an encoding (generator) 
function G : W^^'' i— ^ F^^" that takes as input the k elements 
and outputs n coded elements 



G(x)-y = [yi...y„], 



(2) 



a > 



(3) 



where each encoded element (which can be also seen as a 
random variable) has entropy 

M 

T 

for all i E [n]. The generator function G defines a code C. The 
rate of the code is the ratio of the aggregate useful information 
to the aggregate stored information, i.e., the entropy of the 
source elements to the sum of the entropy of each encoded 
element 

H{Xi. 



R = 



,Xk) ^ k 



(4) 



with equality when a — 

Definition 1 (Minimum Code Distance): The minimum 
distance d of the code C is equal to the minimum number 
of erasures of elements in y after which the entropy of the 
non-erased variables is strictly less than M, that is. 



d = min \£\, 



(5) 



such that H {{Yi, . . . ,Yn}\E) < M and £ e 2^^^'-^^\ 
where 2^^^' - '^"^ is the power set of the elements in 

{Y,,...,Yr,}. 

In other words, a code has minimum distance d, when there 
is "enough" entropy after any d — 1 coded element erasures 
to reconstruct the file. The above definition can be restated 
in its "dual" form: the minimum distance d of the code C is 
equal to n minus the maximum number of non-erased coded 
elements in y that cannot reconstruct the file, that is, d — 
n — max//(5)<A/ where S e 2^^^' - -^"\ 

Remark 1: Observe that the above distance definition is 
universal in the sense that it applies to linear or nonlinear codes 
and is oblivious to any type of element subpacketization. 

We continue by explicitly defining repair locality. 

Definition 2 (Repair Locality): A coded element Yi, i G 
[n], has repair locality r, if it is a function of r other coded 
variables Yi — JiiY-jK^^-^). The set TZ{i) indexes the smallest 
set of r coded elements that can reconstruct Yi and fi is some 
function on these r coded elements. 

In | |T4[ Gopalan et al. show that for length n scalar linear 
codes, where G is a linear function on x, each coded element 



Yi, i e [n], has entropy a = and locality r, then the 
minimum code distance is bounded as 

'k' 



d < n — k — 



(6) 



Observe that according to the above bound, low-locality 
r << A; is penalizing minimum distance by a component 
of K This distance, or reliability, penalty cannot be avoided 
for scalar codes. On the other hand, maximum reliability, i.e., 
d = n — fc + 1, costs in locality. Indeed an {n, fc)-Maximum- 
Distance Separable (MDS) code has both the maximum pos- 
sible distance n — k + 1 and the worst possible locality r = k. 
However, for our purposes we would like locality to be low: 
either a constant, or a sub-linear function of k. 

In the following, we derive an information theoretic tradeoff 
between locality r, distance d, and storage per node a. We see 
how the third parameter a can be used to define operational 
points of high distance and low locality. We will refer to codes 
that achieve this tradeoff as [n, k,r, d, a)-LRCs. 

We will eventually present explicit LRCs for any n,k,r 
that have the "(n, k) erasure property", i.e., that any set of k 
coded elements has entropy at least M, which is equivalent to 
requiring distance d = n — k + 1. This operational point will 
require storage a 
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B. An information theoretic (r, d, a) tradeoff 

In the following we determine the information theoretic 
minimum distance of a code, where each coded element has 
entropy a and repair locality r. We do that by an algorithmic 
proof in the same manner as fl4| that bounds the distance for 
any possible code C. We give a lower bound over all codes, 
of the largest set S of coded elements whose entropy is less 
than M. To simplify calculations, we denote the storage of 
each node as a = (1 + e)^, where e > 0. 

The only structural property of a code that we use in our 
proof, is the fact that there exist (r + 1) sized repair groups. 
For a code of length n, locality r, and for each of its coded 
elements, say Yi, there exist at most other r coded elements 
Y-jK^i) that can reconstruct Yi, for i e [n]. Then, the coded 
elements indexed by r{i) = {i,TZ{i)} from an (r + l)-group, 
that has the property 

H{Yri,)) = y^w) = HiYn(i)) < ra, (7) 

for all i e [n], due to the functional dependencies induced 
by the locality property. To determine the upper bound on 
minimum distance of a C{n,r,d,a) code, we construct the 
maximum set of nodes, or coded elements, S that have entropy 
less than the M. 

Theorem 1: For a code C(n, r, d, a), the minimum distance 
is bounded as 



d < 



M' 
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(8) 



Proof: The proof can be found in the Appendix of the 



full- version of the manuscript 1 17|. 



Remark 2: We would like to note that the main difference 
of our proving technique compared to the one in |14|, is 
that it involves counting arguments on information flows, or 
entropies, instead of ranks of matrices. 

Corollary 1: In terms of the code distance, non-overlapping 
(r + l)-groups are optimal. 

Observe that if we set e = in the above bound, we get 
the same bound as [14]. This means that the bound derived 



in 1 14 1 applies to nonlinear codes as well. 



In the following, for simplicity we will assume that (r+1) |n 
and then prove that the above bound is achievable using infor- 
mation flows and random linear network coding techniques. 

III. ACHIEVABILITY OF THE BOUND: RANDOM LRCS 

In this section, we show that the bound of Theorem [T] is 
achievable using a random linear network coding (RLNC) 
scheme |16|. In our proof, we use a variant of the information 
flow graph of |j2j and show that when the (n, k, r, d, a) 
parameters of an LRC agree with the bound in Theorem [T] 
then the min-cut of this flow graph is large enough for a 
specific multicast session to be feasible. The feasibility of this 
multicasting problem is shown to be equivalent to the existence 
of (n, fc, r, d, q;)-LRCs. More precisely we have the following 
theorem which we prove in this section. 

Theorem 2: Let (r + l)|n. Then, there exist vector linear 
codes (n, fc, r, d, a)-LRCs over F2", that have minimum dis- 
tance d = n - [^] - [^] - 1. 

In the same manner as |2|, the information flow graph is a 
directed network, where the k input elements correspond to k 
sources, the n coded elements are represented as intermediate 
nodes, and the sinks of the network are what we call Data 
Collectors (DC), each of which requires to decode all k 
source elements. The specifications of this network, such as the 
number and degree of nodes, the edge-capacities, and the cut- 
set bounds, are determined by the (n, /c, r, d. a) parameters. 
Here, in contrast to the work in ||2), we need to account for 
the locality properties of the code. By incorporating a subgraph 
that accounts for the dependencies among the coded elements 
of a repair group, we obtain our "locality aware" flow graph. 

In Fig. [T] we show the general flow graph that we use in 
our proof. The network that is defined by the flow-graph has 
k sources and N = sinks (DCs). We refer to this 

directed graph as Q{n, k, r, d, a) with vertex set 

v^{{x,■,^e[k]}, {r7,rf;je [n]}, 

jG [n]},{DQ;We [iV]}}. 
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The directed edge set is implied by the following edge capacity 
function 



2(11, u) = < 



coAv, u) G 



u({r: 



{X,;ie[k]},[ 



3 e LfTTjI'i^";^ 

, J G [n]}, {DCi;lelN]}), 
«,^)G ({y/jG H}, {i^""'iG [n]}) 



G n 



}) 



a 

0, otherwise 



The vertices {Xi;i G [k]} correspond to the k file elements 
and {Yj°"*;j G [n]} correspond to the coded elements. The 




n-d + l 




n coded elements 



DCs 



Fig. 1. We sketch the directed acyclic infonnation flow graph 
Q{n, k, r, d, a). The black vertices correpond to the k sources, each of which 



has entropy -rr- The 



r + l 



white vertices correspond to nodes that bottleneck 



the in-flow within a repair group. The blue vertices con'espond to the n nodes, 
or coded elements, and the yellow vertices are the DCs (sinks) of the network. 



edge capacity between the in- and out- Yi vertices corresponds 
to the entropy of a single coded block. The fundamental 
difference with the flow graph of |2| is the additional flow 
constraints invoked by repair locality assumptions: coded 
elements (nodes) in an (r + l)-group have joint flow (entropy) 
at most ra, instead of {r + l)a. To enforce this constraint, we 
bottleneck the in-flow of each group by a node that restricts 
it to be at most ra. In Fig. |2] we show the part of the flow 
graph that enforces the "bottleneck" induced by a repair group, 
where we consider the first group, without loss of generality. 

we add node F'" that receives 



For a group T{i), i G 



r+l 



in-degree = ;■ 




Fig. 2. The flow bottlenecks of an (r + l)-group of elements. Observe 
that when an element is lost, a new node can join the system and download 
everything from the remaining nodes in its group. 

flow from the sources and is connected with an edge of 
capacity ra to a new node r°"'. The latter connects to the r + l 
elements of the i-th group. We should note that when we are 
considering a specific group, it is implied that any block within 
that group can be repaired from the remaining r elements. 
When a block is lost, the functional dependence among the 
elements of that group allows a newcomer block to compute 
a function on the remaining r elements and reconstruct what 
was lost. 

Linear combinations of the file elements travel along the 
edges of this graph towards the sinks, which we call Data 



Collectors (DCs). A DC needs to connect to as many coded 
elements as such that it can reconstruct the file. This is 
equivalent to requiring source-to-sink (s — t) cuts between the 
file elements and the DCs that are at least equal to M, i.e., the 
file size. An s~t cut in G{n, k, r, d, a) determines the amount 
of flow, or entropy, that can travel from the source elements 
to the destinations. When d is consistent with the bound of 
Theorem [T] then the minimum of all the cuts is at least as 
much as the file size M. 

Lemma 1: The minimum source-DC cut in Q{n,k,r,d,a) 
is larger than or equal to M, when d is consistent Theorem [T] 

Proof: The proof can be found in the Appendix of the 
full-version of the manuscript |17|. ■ 
Then, a successful multicast session on Gin, k, r, d, a) can be 
interpreted as, and is equivalent to, all DCs decoding the file, 
i.e., all k source elements can be reconstructed at each sink 
using the received linear combinations. Interestingly, the linear 
combinations of the elements along the edges between the n 
node couples (y/", y^""'), are exactly the coding coefficients 
that need to be used by the n coded elements to achieve 
distance d. Hence, we will use the following lemma to prove 
the existence of codes. 

Lemma 2 (RLNC): For a network with k sources and 
destinations where rj links transmit linear combination of 
inputs, the probability of success is at least ^ "y) 
We can now combine the above with the fact that there exists 
a RLNC that succeeds when q > N, i.e., when q > ('j) — 
( I'M"! +'|^M1 to obtain Theorem|2j For simplicity we used 
the upper bound 2" for the binomial coefficient (") . 

IV. Explicit LRC Constructions 
A. The d — n — k + 1 operational point 

In this section, we study the operational point of d ~ n—k+ 
1. We calculate the minimum storage overhead that allows the 
"any k property" and we construct explicit LRCs. Our codes 
are based on existing MDS codes, like Reed-Solomon (RS) 
codes, and the finite field order that we require is q > n. 

We first solve for the storage overhead that is required to 
have distance n — fc + 1. This overhead e is the minimum one 
that satisfies the equation d = 71— — +2, where 

d = n — k + 1. Therefore, the minimum storage overhead 
for erasure distance can be found through the following 
optimization 



subject to: 



" k ' 


+ 


" k ' 


l + e 




r(l + e) 



= k + l 



(9) 



e > 0. 



Due to the ceiling function in the above constraint we do 
not obtain a closed form expression of the minimizer emin. 
However, we identify a potential range of e values that satisfy 
the equation 



fc + 1 = n - 



" k ' 




" k ' 


l + e 




r(l + e) 



Lemma 3: An LRC with node repair locality r and distance 
n — k + 1 requires an additional storage overhead that is equal 



to e„ 



Sk, where 6k e 



0, 



^ 1 

' fc+i 



to find the minimum e that satisfies the 



Proof: The proof can be found in the Appendix of the 
full-version of the manuscript |17| . ■ 
Remark 3: We can always perform a line search within 

1 _ r+l 1 
r r k-\-l ' r 

erasure distance. 
In the following we present LRCs with repair locality r for 
e = i. In the proof of Lemma 3, we show that when e = 
then d = n — A; + 1 is the maximum possible distance when 
(r+l) \ k. When (r+l)|fc the optimal distance is d = n— fc+2. 

B. Code Construction 

The codes that follow are optimal, i.e., LRC, for all n, fc, r, 
when (r + 1) I fc and (r + l)|n. 

Let a file x, of size M — rk, that is subpacketized in r 
parts, X = [x'^-'-^ . . . x*^*"^] , with each x^*^ i g [r], having 
size k. We encode each of the r file parts independently, into 
coded vectors y^*'' of length n, where (r + l)|n, using an 
outer (n, k) MDS code y(i) = x'^^G, . . . , y^*^) = x^'^^G, 
where G is the n x k MDS generator matrix. As MDS 
pre-codes, we use (n, fc)-RS codes that require F^, with 
q > n. We generate a single parity sum vector from all 
the coded vectors s = X]I=iy''*'- This precoding process 
yields a total of rn coded blocks in the y'*^ vectors and 
n parity blocks in s, i.e., an aggregate of (r + l)n blocks 
available to place in n nodes. In our code, each node expends 
a— ^ + ^^=r + l (coded blocks) in storage capacity. 

Below we state the circular placement of elements in nodes 
of the first (r + 1) -group 





node 1 


node 2 




node r 


node r + l 


blocks of y'^-* 








1 1 ) 
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blocks of y'*^^ 
blocks of s 
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Sr + 1 
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yr-2 

Sr-l 





(10) 



There are three key properties that are obeyed by our 
placement: 

i) each node contains r coded blocks coming from different 
y*^'^ coded vectors and 1 additional parity element, 

ii) The blocks in the r + l nodes of the z-th (r + l)-group, 
have indices that appear only in that specific repair group, 

i G -TT , and 

[r+l J ' 

Hi) the blocks of each "row" have indices that obey a 
circular pattern, i.e., the first row of elements has indices 
{1, 2, . . . , r + 1} the second {2, 3, . . . , r + 1, 1}, and so on. 

We follow the same placement for all groups. In 
we show an LRC of the above construction with n = 
k = 4 that has locality 2. 

C. Repairing Lost Nodes 

We wiU concentrate on the repair of lost nodes in the 
first repair group of r nodes. This is sufficient since the 



Preceding 




Placement 





Fig. 3. A (6, 4)-LRC with locality 2. Observe that we spent 3 blocks 
per coded element, instead of ^ = 2. This allows us to have easy and 
repairability with locality 2. The distance of this code is guaranteed through 
the 2 (6,4)-MDS precodes. 



block placement is identical across repair groups. The key 
observation is that each node within a repair group stores r + 1 
blocks of distinct indices: the r + 1 blocks of a particular index 
are stored in r + 1 distinct nodes within the repair group. 
Hence, when for example the first node fails, then element 
y[^^ is regenerated by downloading si from the second node 
in the group, y[''^^^ from the third, . . . , and y^'^ from the last 
node in the group. A simple XOR of the above blocks suffices 
to reconstruct the lost block. Hence, for every lost block of 
a failed node, the r remaining blocks of the same index that 
are stored in the r remaining nodes of the repair group have 
to be XORed to regenerate what was lost. 

Hence, a single block repair has locality r. Interestingly, the 
same applies to the repair locality of an entire node. For each 
lost block, r other coded blocks are used for regeneration, and 
all originate from the r remaining nodes. In Fig. |4] we show 
how repair is performed for the code construction presented 
in Fig. [3] 



1^' 3/2 



■E3 + 1/3 



xi+yi 



X3 

yi 



X2 + y2 



^3 + J/3 



Fig. 4. The repair of a node failure. The repair locality is 2 since 2 remaining 
nodes are involved in reconstructing the lost information of the first node, or 
the first coded element. Observe that we repair by only transferring blocks: 
no block combinations are need to be performed at the sender nodes. 



D. Rate and Code Distance 

The effective coding rate of the codes presented in this 
section is 

size of useful information r ■ k 



That is, the rate of the code is a fraction of the coding 
rate of an (n, k) MDS code, hence is always upper bounded 
by ^q-j-. This is due to the extra storage overhead required to 
store the parity blocks Si, i G [n]. 

Remark 4: Observe that if we set the repair locality to r = 
/(fc) and / is a sub-linear function of k (i.e., log(fc) or Vk), 
then we obtain non-trivially low locality r << k, while the 
excess storage cost e = - is vanishing when n, k grow. 

The distance of the presented code is (at least) d = n — k + 1 
due to the MDS precodes that are used in its design: any k 
nodes in the system contain rk distinct file pieces, k from 
each file piece. Hence, performing erasure decoding on these 
r fc-tuples of blocks we can generate the r pieces of the file. 
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